-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: tolerate immediately recoverable stream faults, improve logging #1019
feat: tolerate immediately recoverable stream faults, improve logging #1019
Conversation
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Wow... so this is illuminating. Apparently, deadlines can impact the reconnect logic of the underlying CONNECTION (not stream). gRPC has an underlying connection retry mechanism independent of the stream reconnect. I seems at least in grpc-java, our deadline, as well as the fact our testbed is now down for 5 seconds (instead of just for an instant), has had an impact on that. I'm convinced at this point that the changes in our deadlines and to bring flagd down for 5s (see resolved comment above) instead of just for a second are what broke our e2e tests. The simple solution was to change the cadence of the "unstable" containers... instead of being down for 5s and up for 5s, we go down for 5s and stay up for 20s... considering the connection backoff algorithm linked above this sensibly improves the reliability of the test; before this change I couldn't make the test pass; now it passes consistently; the downside is the tests can take longer if we are unlucky - for that reason I've updated the reconnect tests to wait a max of 240s (though they generally complete much before that). See my changes. |
f6ebe2f
to
276644c
Compare
...ain/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/BackoffService.java
Show resolved
Hide resolved
...ain/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/BackoffService.java
Show resolved
Hide resolved
...in/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/CombinedBackoff.java
Show resolved
Hide resolved
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
276644c
to
801e5d0
Compare
Could you inject your backoff stuff into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but consider: #1019 (comment)
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
801e5d0
to
6d9b7f8
Compare
@guidobrei I updated this by rebasing on main, sorry!
|
…b.com/guidobrei/open-feature-java-sdk-contrib into feat/1010-improve-flagd-error-logging
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
...ature/contrib/providers/flagd/resolver/common/backoff/GrpcStreamConnectorBackoffService.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Nice to see this improvement. I will soon use this as the basis for some new specifications in the flagd provider specs.
This PR
FlagdOptions.getRetryBackoffMs()
to initialize the backoff in all Backoff scenarios.GrpcStreamConnector
previously used a hardcoded value of 2 seconds.GrpcStreamConnector
. This removes a backoff when a planned deadline exceeds and the connector reconnects.Fixes #1010
Notes
Different to #1010, error logs are not written when the max retry delay is reached, but already at the second error in a row.
Waiting for max retry delay (120 seconds) with exponential backoff starting with 2 seconds would require 126 seconds until the first error gets visible.
Instead, error logs are generated whenever an error queue payload is emitted. Only on the first error we try to reconnect immediately without any backoff (only with default jitter 250ms max) and without emitting an error payload. Starting with the second error in a row we log an error and emit the error payload.
The initial Backoff is now
FlagdOptions.getRetryBackoffMs()
inGrpcStreamConnector
(new) andGrpcConnector
(no change).For the
GrpcStreamConnector
this means an initial Backoff of 1 sec (default option) instead of 2 secs.I've also removed the special handling of
DEADLINE_EXCEEDED' errors, as the connector now tries to reconnect silently on any first error. This also solves
DEADLINE_EXCEEDED` issues related to Envoy, where a wrong gRPC status code is reported. See hereWith the first immediate retry the new Backoff times for
GrpcStreamConnector
are now: